This notebook is a prototype of the full pipeline used in this project. It will provide an example of the entire project that is currently spread over multiple notebooks, covering every step. Including:

Generation of feature vectors
Training of the classifier
Classifying Pulldown network
Community detection
Comparison of weighted and unweighted graphs

Generating feature vectors

There are three files that are generated at this stage. All consist of feature vectors corresponding to protein pairs in the following three sets:

Positive training set interactions
Negative training set interactions
SYNSYS pulldown network interactions

This pipeline will take advantage of a prepared list of interactions to build the network rather than trying to classify all possible combinations of the pulldown proteins.

Preparing the assembler

To run the assembler a data source table must first be generated. This table describes the data sources to be used to build feature vectors. In this case, we will be using simple features with high coverage to avoid dealing with the missing data problem. This task is also performed in the notebook on training and test set generation.

This table must be placed at the top directory of the feature data:



In [9]:

    
cd ../..









    



/data/opencast/MRes



In [5]:

    
import csv



In [4]:

    
!git annex unlock datasource.proto.tab









    



unlock datasource.proto.tab (copying...) ok



In [5]:

    
f = open("datasource.proto.tab","w")
c = csv.writer(f,delimiter="\t")
# the HIPPIE feature
c.writerow(["HIPPIE/hippie_current.txt","HIPPIE/feature.HIPPIE.db","protindexes=(1,3);valindexes=(4);zeromissing=1"])
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# iRefIndex features
c.writerow(["iRefIndex","iRefIndex","generator=iRefIndex/human.iRefIndex.Entrez.1ofk.pickle"])
# STRING features
c.writerow(["STRING","STRING","generator=string/human.Entrez.string.pickle"])
# ENTS summary feature
c.writerow(['ENTS_summary','ENTS_summary','generator=ents/human.Entrez.ENTS.summary.pickle'])
f.close()

Importing the ocbio code

The code to run the assembler must then be imported, and to import it, it must be added to the path:



In [4]:

    
import sys



In [7]:

    
sys.path.append("opencast-bio/")



In [8]:

    
import ocbio.extract



In [13]:

    
reload(ocbio.extract)









    Out[13]:





<module 'ocbio.extract' from 'opencast-bio/ocbio/extract.py'>

Unlocking databases

The entire data repository is version controlled with git annex and this means that databases must be unlocked before they can be opened.



In [10]:

    
!git annex unlock HIPPIE/feature.HIPPIE.db

Instantiate assembler



In [14]:

    
assembler = ocbio.extract.FeatureVectorAssembler("datasource.proto.tab", verbose=True)









    



Using  from top data directory datasource.proto.tab.
Reading data source table:
	Data source: HIPPIE/hippie_current.txt to be processed to HIPPIE/feature.HIPPIE.db
	Data source: Gene_Ontology to be processed to Gene_Ontology
	Data source: iRefIndex to be processed to iRefIndex
	Data source: STRING to be processed to STRING
	Data source: ENTS_summary to be processed to ENTS_summary
Initialising parsers.
Database HIPPIE/feature.HIPPIE.db last updated 2014-06-25 12:04:07
Finished Initialisation.

Regenerating databases

At this point the method assembler.regenerate could be used to make sure the database is up to date, but since there is only one and it is known to be up to date there is no point.

Assembling feature vectors

Next we use the assembler.assemble method to create the featurem vectors, supplying the method with a list of Entrez Gene ID pairs to match to these vectors:



In [20]:

    
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
                   "features/training.nolabel.proto.positive.Entrez.vectors.txt", verbose=True)









    



Reading pairfile: DIP/human/training.nolabel.positive.Entrez.txt
Checking feature sizes:
	 Data source HIPPIE/hippie_current.txt produces features of size 1.
	 Data source Gene_Ontology produces features of size 90.
	 Data source iRefIndex produces features of size 11.
	 Data source STRING produces features of size 8.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors
Wrote 4857 vectors.
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from HIPPIE/hippie_current.txt
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from Gene_Ontology
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from iRefIndex
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from STRING
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from ENTS_summary



In [15]:

    
assembler.assemble("DIP/human/training.nolabel.negative.Entrez.txt",
                   "features/training.nolabel.proto.negative.Entrez.vectors.txt", verbose=True)









    



Reading pairfile: DIP/human/training.nolabel.negative.Entrez.txt
Checking feature sizes:
	 Data source HIPPIE/hippie_current.txt produces features of size 1.
	 Data source Gene_Ontology produces features of size 90.
	 Data source iRefIndex produces features of size 11.
	 Data source STRING produces features of size 8.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors..................................................................................................................................................................................................................................................................................................................
Wrote 3061800 vectors.
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from HIPPIE/hippie_current.txt
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from Gene_Ontology
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from iRefIndex
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from STRING
Matched 100.00 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from ENTS_summary



In [16]:

    
assembler.assemble("forGAVIN/mergecode/edgelist_new.txt",
                   "features/edgelist.proto.vectors.txt", verbose=True)









    



Reading pairfile: forGAVIN/mergecode/edgelist_new.txt
Checking feature sizes:
	 Data source HIPPIE/hippie_current.txt produces features of size 1.
	 Data source Gene_Ontology produces features of size 90.
	 Data source iRefIndex produces features of size 11.
	 Data source STRING produces features of size 8.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors.
Wrote 16770 vectors.
Matched 100.00 % of protein pairs in forGAVIN/mergecode/edgelist_new.txt to features from HIPPIE/hippie_current.txt
Matched 100.00 % of protein pairs in forGAVIN/mergecode/edgelist_new.txt to features from Gene_Ontology
Matched 100.00 % of protein pairs in forGAVIN/mergecode/edgelist_new.txt to features from iRefIndex
Matched 100.00 % of protein pairs in forGAVIN/mergecode/edgelist_new.txt to features from STRING
Matched 100.00 % of protein pairs in forGAVIN/mergecode/edgelist_new.txt to features from ENTS_summary

Training the classifier

Now that the feature vectors have been written we will use the training vectors to train a classifier which we can apply to the vectors written to edgelist.proto.vectors.txt. We will be using an ensemble classifier to slightly improve performance over a simple classifier such as logistic regression. The chosen classifier, as stated in the proposal, is a random forest classifier as it has proven to have superior performance in Qi's 2006 paper.

Supervised Binary Classification

The task of protein interaction prediction as approached here is a supervised binary classification problem. What do those words actually mean?

Supervised - the classifier is trained on labeled data; there are two sets of vectors used for training:
- One is the positive interactions - assumed to be true interactions taken from the DIP human dataset.
- The other are negative interactions - assumed to be false interactions generated as random binary combinations of human Entrez Gene IDs, each of which are very unlikely to be real interactions.
- Important note: the number of negative interactions was chosen to be 600 times the number of positive as this was recommended in Qi's work.
Binary - there are only two options, interacting or not interacting. 1 is interaction, 0 is not interacting.
Classification - there are two main types of machine learning problem:
- Classification - dealing with predicting a discrete random variable, which in this case is either 1 or 0.
- Regression - dealing with predicting a continuous random variable.

Random Forest Classifier

A random forest classifier is a combination of large number of generated decision trees. Decision trees are intuitive to understand; the name practically describes what they are. At each node in the tree one of the features is tested and a decision is made on what feature to check next or which output to provide:

Random forests simply generate a large number of these trees with varying structures and then combine the results.

Loading the training data

The training data is a sparse dataset in that it contains a very large number of zeros. This can be efficiently represented using a Scipy sparse matrix. The csv can be loaded using the csv module of Python:



In [17]:

    
import csv
import scipy.sparse

At the moment we have to read off the rows and columns to initialise this matrix with from above.



In [80]:

    
#initialising matrices
posrows,poscolumns = 4857,111
negrows,negcolumns = 3061800,111
posvectors = scipy.sparse.lil_matrix((posrows,poscolumns))
negvectors = scipy.sparse.lil_matrix((negrows,negcolumns))



In [81]:

    
#loading positive vectors
f = open("features/training.nolabel.proto.positive.Entrez.vectors.txt")
c = csv.reader(f, delimiter="\t")
lcount = 0
for line in c:
    line = array(line)
    posvectors[lcount,:] = line.astype(np.float)
    lcount += 1
f.close()



In [82]:

    
#loading negative vectors
f = open("features/training.nolabel.proto.negative.Entrez.vectors.txt")
c = csv.reader(f, delimiter="\t")
lcount = 0
for line in c:
    line = array(line)
    negvectors[lcount,:] = line.astype(np.float)
    lcount += 1
f.close()

Labeling, shuffling and slicing the training data

At this point there are two large matrices posvectors and negvectors. As the names suggest they are the positive and negative feature vectors. Scikit-learn requires binary labels for these vectors. To do this we create a vector of labels the same length as these two matrices combined. These are then shuffled to ensure that the training a test sets both contain a random mixture of positive and negative examples.

This task is also shown in the notebook on training the classifier.

What is the difference between the training and test set?

When training a classifier there is usually a distinction between the training and test sets. The difference between them is that the training set is actually used to fit the model. To avoid overfitting we test this fitted model on a randomly selected training set as testing the model on the training set will typically produce unrealistic results. This will be demonstrated below.



In [85]:

    
# making the label vector
ylabels = array([1.0]*posvectors.shape[0] + [0.0]*negvectors.shape[0])



In [86]:

    
# shuffling is done using scikit-learn's shuffle
import sklearn.utils



In [88]:

    
# concatenate the data together and shuffle
X,y = sklearn.utils.shuffle(scipy.sparse.vstack([posvectors,negvectors]),ylabels)



In [90]:

    
#find quarter mark
quarter = int(X.shape[0])/4
#split into training and test
Xtrain, Xtest = X[:3*quarter], X[3*quarter:]
ytrain, ytest = y[:3*quarter], y[3*quarter:]

Fitting the model

The bulk of this work is done inside Scikit-learn's random forest classifier. All classifiers in Scikit-learn have a similar interface. First, the classifier is instantiated with the relevant options. In this case n_estimators, which is the number of trees to use in the forest.

Second, the fit method is used to fit the model. The training set is used here as the X and y arguments.



In [91]:

    
import sklearn.ensemble



In [93]:

    
randomforest = sklearn.ensemble.RandomForestClassifier(n_estimators=10)



In [108]:

    
randomforest.fit(Xtrain.todense(),ytrain)









    Out[108]:





RandomForestClassifier(bootstrap=True, compute_importances=False,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=0.1, min_split=1, n_estimators=10, n_jobs=1,
            random_state=<mtrand.RandomState object at 0x7f8fce12c7f8>)

Testing

After we have fit the model, typically we want to know whether our model fits the data well or not. In the task of classification we would like to get an idea of the error rate and accuracy we can expect when we apply this model to our problem. A typical tool to quickly describe the sensitivity and specificity of a binary classifier is a ROC curve.

What is a ROC curve?

ROC stands for Receiver Operating Characteristic. It is a plot of the true positive rate against the false positive rate for a classifier as the classification threshold is varied. The wikipedia page describes it in more detail if required.

We can build ROC curves running the classifier on both the test data and training data to illustrate the overfitting problem mentioned above. First, running the classifier on the training data should give us inflated results:



In [96]:

    
import sklearn.metrics



In [120]:

    
def plotroc(ytest,estimates):
    fpr, tpr, thresholds = sklearn.metrics.roc_curve(ytest,estimates[:,1])
    clf()
    plot(fpr, tpr)
    plot([0, 1], [0, 1], 'k--')
    xlim([0.0, 1.0])
    ylim([0.0, 1.0])
    xlabel('False Positive Rate')
    ylabel('True Positive Rate')
    title('Receiver operating characteristic')
    show()
    print "Area under curve (AUC): {0}".format(sklearn.metrics.auc(fpr,tpr))
    return fpr,tpr



In [117]:

    
estimates = randomforest.predict_proba(Xtrain.todense())

Training data



In [123]:

    
trfpr,trtpr = plotroc(ytrain,estimates)









    












    



Area under curve (AUC): 0.970124702687

Then if we do the same thing with the test data:

Test data



In [124]:

    
estimates = randomforest.predict_proba(Xtest.todense())



In [125]:

    
tefpr,tetpr = plotroc(ytest,estimates)









    












    



Area under curve (AUC): 0.951887435327

The difference on this case is actually very small, and this is likely a characteristic of the data used to train the model.

Cross-validation

Cross validation involves taking multiple splits of the training and test data and evaluating the AUC on the test data of each. In the real pipeline this will be the preferred method of testing to avoid outlier test sets which predict unrealistic performance.

AUC values on sparse data

AUC values give a single value figure of merit of a ROC curve. The closer the value is to 1, the closer the classifier is to being a perfect predictor for the data.

The training labels in this task are very sparsely populated, with 600 zeros for every one. On this kind of data, and as mentioned by Qi et al, it is more common to use a reduced AUC value, which only considers the first portion of the ROC curve. This is because in the later sections of the curve the false positive rate of the predictor is rising and in a practical setting only low false positive rates are acceptable.

A value described in Qi et al (2006) was the R50 value. This is described as being the are under the curve up until there are 50 false positives detected. In their case this is up to a false positive rate of 0.167%. We have 766,665 test examples so our R50 value would be up to a rate of 0.000652%, which is clearly impractical.

The description in the paper of this is likely badly worded. It also claims that this value is in widespread use, but apart from in this paper it is very difficult to find any mention of it. In any case, we can find the area under this ROC curve up until reaching a false positive rate of 0.167%.



In [171]:

    
r50trfpr = trfpr[trfpr<0.00167]
r50crv = array([r50trfpr,trtpr[:r50trfpr.shape[0]]])
print "R50 value for Training data {0}".format(sum(r50crv[1,:])/len(r50crv[1,:]))









    



R50 value for Training data 0.692875164687



In [174]:

    
r50tefpr = tefpr[tefpr<0.00167]
r50crv = array([r50tefpr,tetpr[:r50tefpr.shape[0]]])
print "R50 value for Test data {0}".format(sum(r50crv[1,:])/len(r50crv[1,:]))









    



R50 value for Test data 0.631570349545

Or up until reaching a false positive rate of 0.000652%:



In [175]:

    
r50tefpr = tefpr[tefpr<0.00000652]
r50crv = array([r50tefpr,tetpr[:r50tefpr.shape[0]]])
print "R50 value for Test data {0}".format(sum(r50crv[1,:])/len(r50crv[1,:]))









    



R50 value for Test data 0.259467037046

Interestingly, the first value of 0.631 compares well with the performance of the classifiers in the paper by Qi et al (2006).

Which features are being used?

We may like to know which features are being weighted more heavily in this prediction over others. This is done by removing features, retraining the classifier and observing the change in performance. This will be done in the full notebook on classifier training.

Classifying Pulldown Network

Here, we are using a network of pulldown proteins created through merging known interactions from three sources when queried for a list of proteins returned by a pulldown experiment. These three sources were:

HIPPIE
HMR - Human-Mouse-Rat database
InterologWalk

This list of interactions was then hand-curated by removing some proteins with extremely high numbers of connections to aid the community detection algorithms. The resulting list of interactions defines the protein-protein interaction network of interest. Adding weights to this list of edges is a key part of this project. This will be done using the classifier trained above and the feature vectors created for this list of interactions.



In [177]:

    
# using loadtxt now as it's small
pulldownvectors = loadtxt("features/edgelist.proto.vectors.txt")



In [227]:

    
cd ..









    



/data/opencast/MRes



In [228]:

    
pulldownestimates = randomforest.predict_proba(pulldownvectors)



In [229]:

    
# write the results to a file
# with the associated protein pairs
f,fr = open("features/edgelist.proto.predictions.txt", "w"), open("forGAVIN/mergecode/edgelist_new.txt")
c,cr = csv.writer(f,delimiter="\t"),csv.reader(fr,delimiter="\t")
lcount = 0
for line in cr:
    c.writerow(line+["%.5f"%pulldownestimates[lcount,1]])
    lcount += 1
f.close()

Unfortunately, because there was a header in the edgelist file, the first "interaction" is between the string "columns" and the number 2. This must be removed:



In [230]:

    
!head -n 1 features/edgelist.proto.predictions.txt



In [231]:

    
!tail -n +2 features/edgelist.proto.predictions.txt > features/edgelist.proto.predictions.txt.tmp



In [232]:

    
!mv features/edgelist.proto.predictions.txt.tmp features/edgelist.proto.predictions.txt

Community Detection

The next step is to apply a community detection algorithm to both the weighted and unweighted files. This algorithm has already been written in C++, and all that has to be done is for it to be executed as in the notes on community detection.

First, it must be compiled:



In [10]:

    
cd HBP/









    



/data/opencast/MRes/HBP



In [201]:

    
!make









    



Building with option-useomp
make: `run' is up to date.

Unweighted case



In [206]:

    
!./run -file ../features/edgelist.proto.predictions.txt -weighted 0 -algorithm 2 -quite 1









    



> ---- START ---- 
> Using Spectral Betweenness.
> Using a non-weighted network 
> Scanning network_file: ../features/edgelist.proto.predictions.txt
> Header: None 
> Running Spectral Betweenness.
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
--- DONE ---
> cputime: 391.54 seconds 
> Network (nodes): 1804 (edges): 15618
> ---- END ----

The output is always written to the OUT file, so moving it before running again.



In [207]:

    
!mv OUT protounweighted

Weighted case



In [13]:

    
!mkdir OUT



In [14]:

    
!./run -file ../features/edgelist.proto.predictions.txt -weighted 1 -algorithm 3 -quite 1









    



> ---- START ---- 
> Using Spectral Betweenness.
> Using a weighted network 
> Scanning network_file: ../features/edgelist.proto.predictions.txt
> Header: None 
> Running Spectral Betweenness.
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
Warning: Too many iterations in tqli
--- DONE ---
> cputime: 3946.31 seconds 
> Network (nodes): 1804 (edges): 15618
> ---- END ----



In [15]:

    
!mv OUT protoweighted

The above warning may be a problem: Warning: Too many iterations in tqli. Will have to ask about it.

Comparison of results

The results using weighted an unweighted can be compared in terms of their similarity or in terms of their disease enrichment results. The similarity between the results can be summarised by Normalised Mutual Information, which returns a summary value between 0 and 1 which indicates how similar the communities are. Results of the disease enrichment test require interpretation by eye.

Normalised Mutual Information (NMI)

Unfortunately, having problems running the code below at the moment. Need to look into it.

Disease Enrichment

When running the above community detection code the communities found are compared in terms of a disease enrichment test which indicates the likelihood a given community is involved in a given disease. This is also targeted by default on a specific set of markers for Schizophrenia. Looking at the differences between these two results will take place in a more detailed treatment of the community detection part of this project and will probably be found in the future in the community detection notebook.

Appendices

Defining some functions here used above to calculate NMI. The following code is not mine, it was adapted from code provided by my supervisor:



In [1]:

    
from __future__ import division
import collections
import math



In [2]:

    
#--- helper function
def convertStr(s):
    ret = int(s)
    return ret



In [22]:

    
import pdb



In [23]:

    
def NMI(fname1,fname2):
    rows, new, dic = [], [], []
    consensus = fname1
    community = fname2
    
    #--- readin first community file
    with open(consensus, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter='=')
        headerline = reader.next();
        for row in reader:
            new.append(row)

    #--- readin second community file
    with open(community, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter='=')
        headerline = reader.next();
        for row in reader:
            dic.append(row)
    for i in dic:
        for j in new:
            if i[0] == j[0]:
                i.append(j[1])
    pdb.set_trace()
    dic.sort(key=lambda x: int(x[2]))

    #--- count of nodes in each community in reality and in algorithm
    realc, algc = [], []
    for i in dic:
        realc.append(convertStr(i[1]))
        algc.append(convertStr(i[2]))
    comcount = dict([(i,realc.count(i)) for i in set(realc)])
    pcount = dict([(i,algc.count(i)) for i in set(algc)])

    #--- holds the algorithm community label with the corresponding community labels
    #    provided e.g., {1:{2:2,3:3}}
    algcount, tempd = {}, {}
    templist = []
    comtrack = 0
    for key, val in pcount.iteritems():
        templist = realc[comtrack:comtrack+val]
        comtrack += val
        for i in set(templist):
            tempd[i] = templist.count(i)
        algcount[key] = tempd
        tempd = {}

    #--- finding the node with the maximum label in that community
    #    key = int, value = dictionary
    label = {}
    for key, value in algcount.iteritems():
        label[key] = max(value, key = value.get)
    for i in dic:
        for j in label:
            if (int)(i[2]) == j:
                i.append(label[j])

    #--- METRICS    
    #--- Calculate the NMI
    nt, np, ntp = {}, {}, {}
    NMI = 0
    nodes = len(dic)
    for i in range(1, max(comcount)+1):
        for j in range(1, max(pcount)+1):
            nt[i] = 0
            np[j] = 0
            ntp[(str(i)+' '+str(j))] = 0
    nt = comcount
    np = pcount
    for i in dic:
        ntp[(str(i[1]).strip()+' '+str(i[2]).strip())] += 1
    num, den = 0, 0
    for i, t in nt.iteritems():
        for j, p in np.iteritems():
            temp, temp2 = 0, 0
            if (ntp[(str(i).strip()+' '+str(j).strip())] > 0) and (t > 0) and (p > 0):
                temp = ((ntp[(str(i).strip()+' '+str(j).strip())]*nodes)/(t*p))
                temp2 = ntp[(str(i).strip()+' '+str(j).strip())]* math.log(temp)
            num += temp2
    d1, d2 = 0, 0
    for t in nt.values():
            d1 += t * math.log(t/nodes)
    for p in np.values():
            d2 += p * math.log(p/nodes)
    den = d1 + d2
    NMI = -(2*num)/den


    #--- print results to screen
    print 'NMI = %s' %(NMI)
    f = open ( 'OUT/community-metric_nmi.txt', 'w' )
    f.writelines("node\treal\talg\tnewlabel\n")
    f.writelines("%s\t%s\t%s\t%s\n" % (i[0],i[1],i[2],i[3]) for i in dic)
    f.writelines('NMI = %s' %(NMI))
    f.close()
    return None